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O Abstract: The present invention provides a method and apparatus for managing resources in a multithreaded processor. In 
one embodiment, a resource is partitioned into a number of portions based upon a number of threads being executed concurrently. 
Resource allocation for each ttuead is perfonned in its respective portion of the lesoorce. 
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METHOD AND APPARATUS FOR MANAGING RESOURCES IN A 
MULTITHREADED PROCESSOR 

FIELD OF THE INVENTION 

The preseat invention relates generally to the field of multithreaded 
processing. More specifically, the present invention relates to a method and apparatus 
for managing resources in a multithreaded processor. 

BACKGROUND OF THE INVENTION 

Various multithreaded processor designs have been considered in recent tunes 
to further improve the performance of processors, especially to provide for a more 
effective utilization of various processor resources. By executing multiple threads in 
parallel, the various processor resources are more fully utilized which in turn enhances 
the overall performance of the processor. For example, if some of the processor 
resources are idle due to a stall condition or other delay associated with the execution 
of a particular thread, these resources can be utilized to process another thread. A stall 
condition or other delay in the processing of a particular thread may happen due to a 
number of events tiiat can occur in the processor pipeline including, for instance, a 
cache miss or a branch misprediction. Consequently, without multithreading 
capabilities, various available resources within the processor would have been idle 
due to a long-latency operation, for example, a memory access operation to retrieve 
the necessary data fiom main memory, that is needed to resolve the cache miss 
condition. 

Furthermore, multithreaded programs and applications have become more 
common due to the support provided for multithreading programming by a number of 
popular operating systems such as the Windows NT® and UNK operating systems. 
Multithreaded applications are particularly attractive in the area of multimedia 
processing. 

Multithreaded processors may generally be classified as fine or coarse grained 
designs, based upon the particular thread interleaving or switching scheme employed 
within the respective processor. In general, fine grained multithreaded designs 
support multiple active threads within a processor and typically interleave two 
different threads on a cycle-by-cycie basis. Coarse grained multithreaded designs, on 
the other hand, typically interleave the instmctions of different threads on the 
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occuirence of some bng-latency event, such as a cache miss. A coarse multitoeaded 
desiga is discussed in Eickmayer, R., Johnson, R. et al. '•Evaluation of Multithreaded 
Uniprocessors for Commercial Application Envuonments", The 23^ Annual 
Tntemational Symposium on ComDUter Architecture, pp- 203-212, May 1996. The 
distinctions between fine and coarse designs are further discussed in Laudon, J„ 
Gupta, A* "Architectural and Implementation TradeofEs m fee Design of Multiple- 
Context Processors", Multithreaded C omputer Architectums- A Smrmiarv of the State 
of the Art edited by R-A. lannuci et al., pp. 167-200, Kluwer AcadCTiic Publishers, 
Norwell, Massachusetts, 1994, 

While multithreaded designs based on interleaved schemes are generally 
advantageous over single threaded designs, they still have then: own limitations and 
.shortcomings, la fee fine grained multithreaded designs which mterleaves two 
different threads on a cycle-by-cycle basis, there are limitations on fee applications 
due to fee fact that each thread cannot make progress in every cycle, A thread is 
limited to a single instmction in fee pipelme to eliminate fee possibility of pipeline 
dq)endencies. To tolerate memory latOTcy, a thread is prevented firom issuing its next 
mstruction until fee memory operation is completed. However, limiting a thread to a 
single instruction in fee pipeUne causes some constraints. First, a large number of 
threads would be needed to fixlly utilize fee processor. Second, fee performance of a 
single thread is poor because a thread could at best issue a new mstruction evoy 
cycle. While coarse grained multithreaded designs have some advantages over fee 
fine multithreaded designs, feey also have feeir shortcomings. Fnst, fee cost of thread 
switching is high because fee decision to switch is made late in fee pipeline which can 
cause partially executed mstractions in fee pipeline from fee switching thread to be 
squashed. Second, because of fee high cost of thread switching, multiple threads 
cannot be used to tolerate short latencies. 

. SUMMARY OF THE INYENTION 

The present invention provides amefeod and apparatus for managing 
resources in a multithreaded processor. In one embodiment, a resource is partitioned 
into a number of portions based upon a numb^ of tteieads being executed 
concurrently. Resource allocation for each thread is performed in its respective 
portion of fee resource. The partitioning is dynamic and changes as fee number of 
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active threads changes. If only one thread is active, flien all of the resource is devoted 
to that thread- If all threads are active, then tteresotarce is fuDy partition 
BRIEF DESCMFnON OF THE DRAWINGS 

The features and advantages of the present invention will be more fUUy 
understood by reference to the accompanymg drawings, in which: 

Figure 1 is a block diagram of one embodiment of a processor pipeline; 

Figure 2 shows a block diagram of one embodiment of a processor 
architecture in which the teachings of present invention are implemented; 

Figure 3 shows a block diagram of one embodiment of a processor unit that 
implements the teachings of the present mvention; 

Figures 4a illustrates a stmcture of one embodiment of a circular queue in a 
single threading mode; 

Figure 4b illustrates a structure of one embodiment of a circular queue m a 
multithreading mode; 

. Figure 5 is a high level flow diagram of one embodiment of a method for 
managing resources in a multithreaded processor; 

Figure 6 shows a high level flow chart of one embodiment of a method for 
performing resource allocation in a multithreaded processor; 

Figure 7 is a flow diagram of one embodiment of a method for perfoimiog 
resource allocation between two threads in a multithreaded processor; 

Figure 8 is a flow chart of another embodiment of a method for perfoiming 
resource allocation between two threads in a multithreaded processor; 

Figure 9 is a flow chart of another embodiment of a method for performing 
resource allocation between two threads in a multithreaded processor; 

Figure 10 illustrates a high level flow chart of one embodiment of a method 
for performing resource allocation in a multithreaded processor running in a single 
threading mode; 

Figure 11 is a flow diagram of one embodiment of a method for performing 
resource allocation for two threads in a parallel structure; 

Figure 12 is a flow diagram of one embodiment of a method for performing 
resource allocation for two threads ua a multiplexmg manner; 
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Figure 13 illustrates a flow diagram of one embodiment of a mettiod for 
performing stall computation for two threads in parallel and resource allocation in 
multiplexed manner. 

Figure 14 is a detailed flow diagram for pafoiming resource allocation for 

one of the two threads; 

Figure 15 is a detailed flow diagram for performing resource allocation for the 

other of the two threads; 

Figure 16 is a block diagram of one embodiment of an apparatus for 
performing stall confutation and generating a stall signal; 

ngure 17 is a block diagram of another embodiment of an apparatus for 
updating the value of a stall pointer; and 

Figure 18 is a block diagram of one embodiment of an apparatus for updating 
an allocation pointer. 

DETAILED DESCRIPTION 

In the following detailed description numerous specific details are set forth in 
order to provide a thorougfi understanding of the present invention. However, it will 
be appreciated by one skilled in the ar^fliat the present invention may be practiced 
without these ^ecific details. 

In the discussion below, the teachmgs of the presmt invention are utilized to 
implement a method and an ^aratus for managhig various processor resources used 
for the execution ofmultiplettreads in a multithreaded processor. The various 
resources are partitioned according to a partitioning scheme based upon the number of 
threads that are executed concurrently. The partitioning is dynamic and changes as the 
number of actiye threads changes. If only one thread is active, then all of the 
resources are devoted to fliat thread. If all threads are active, then the resources are 
fully partitioned. For ilhistrative and explanation purposes only, the present mventi^^ 
is described with respect to a switching scheme between a smgle threading 
environment and a multithreading environment in which two threads are executed 
concuirently, However, the teachings of the present invention should not be limited to 
two threads and should be applicable to any multithreading environment in which 
more than two threads are executed concurrently. The teachings of the present 
invention are equally ^plicable to any switching scheme between an M-toeaded 
environment to an N-threaded envhonment where M can be greater or less than N 
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(e,g,, switching fiom a 4-threaded environment to a 2-thieadcd environment, 
switching from a 2-aiieaded environment to a 5-tbreaded mviionment, etc.). In one 
embodiment, when two threads are executed in a multitiireading mode, each resource 
is partitioned into two portions. One of the two portions is reserved for the execution 
of one thread and the other portion is reserved for fte execution of the other thread. If 
there are insufficient resources to accommodate the execution of instructions wifliin 
one particular thread, tiien a stall signal is generated to stall forther feeding of 
uistructions from that particular fliread down the processor pipeline until enough 
resources become available. The teachings of the preset invention are appUcable to 
any multithreaded processor that is designed to process multiple threads (e.g., two or 
more threads) concuirently. However, flie teachings ofthe present invaition are not 
limited to multithreaded processors and can be applied to any processor and/or 
machine in which resources are shared between tasks or processes. 

Figure 1 is a block diagram bf one embodiment of a processor pipeline within 
which the present invention may be implemented. For the purposes of the present 
specification, the term **processor" refers to any machine that is capable of executing a 
sequence of instructians and shall be taken to include, but not be limited to, general 
purpose microprocessors, special purpose microprocessocs, gr^hics controllers, audio 
processors, video processors, multi-media controllers and microcontrollers. The 
processor pipeline 100 includes various processing stages beginning with a fetch stage 
110. At this stage, instructions are retrieved and fed into the pipeline 100. For 
example, a macroinstruction may be retrieved from a cache memory that is integral 
within the processor or closely associated therewith, or may be retrieved from an 
external memory unit via a system bus. The instructions retrieved at the fetch stage 
1 10 are then delivered to a decode stage 120 where the instructions or 
macroinstructions are decoded into microinstructions or micro-operations for 
execution by the processor. At an aUocate stage 130, processor resources necessary 
for the execution of the microinstructions are allocated. The next stage in the pipeline 
is a rename stage 140 where references to external or logical registers are converted 
into internal or physical register references to eliminate dependencies caused by 
register reuse. At a schedule/dispatch stage 150, each microinstruction is scheduled 
and dispatched to an execution unit The microinstructions are then executed at an 
execute stage 160. After execution, the micromstractions are then retired at a retire 
stage 170. 
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In one embodunent, the various stages described above can be organized into 
three phases. The first phase can be referred to as an in-order fiont end including the 
fetch stage 1 10. decode stage 120, allocate stage 130, and rename stage 140. During 
the in-ord^ fiont end phase, the instructions proceed through the pipeline 100 in their 
original program order. The .second phase can be referred to as the out-of-order 
execution phase including &e schedule/dispatch stage ISO and the execute stage 160. 
During this phase, each instruction may be schedtiled, dispatched and executed as 
soon as its data dependencies are resolved and the execution unit is available, 
regardless of its sequential position in the original program. The thurd phase, referred 
to as the in-order retirement phase which includes the retire stage 170 in which 
instructions are retired in then: origmal, sequeutial program order to preserve ttie 
integrity and semantics of the program, and to provide a precise intemq)t model. 

Figure 2 is a block diagram of one embodiment of a processor, in the form of 
a general-purpose microprocessor 200, in which the present invention may be 
implemented. The microprocessor 200 described below is a multithreaded (MT) 
processor aiid capable of processing multiple instraction threads simultaneously. 
However, the teachings of the present invention described below are fully applicable 
to other processors that process multiple instruction threads in an interleaved manner 
and also to single thread processors which have the capabilities to process multiple 
instructions either in parallel or in an interleaved manner. In one embodiment, the 
microprocessor 200 may be an Intel Architecture (lA) microprocessor that is capable 
of executing an Intel Architecture instruction set 

The microprocessor 200 comprises an in-order fiont end, an out-of-order 
execution core and an in-order retirOTient back end* Itie microprocessor 200 includes 
a bus interface unit 202 which functions as an interface between the microprocessor 
200 and otiier con^onents (e.g., main memory unit) of a computer system within 
which the microprocessor 200 may be unplemented. The bus interface unit 202 
couples the microprocessor 200 to a processor bus (not shown) via which data and 
control information are transferred betwewi the microprocessor 200 and other system 
components (not shown). The bus interface unit 202 includes Front Side Bus (FSB) 
logic 204 that controls and fecilitates communications over the processor bus. The 
bus interface unit 202 also mcludes a bus queue 206 that is used to provide a buffering 
function with respect to the communications over the processor bus. The bus 
interface unit 202 receives bus requests 208 from a memory execution unit 2 12. The 
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bus interfeoe unit 202 also sends snoops or bus returns to the memory execution unit 
212. 

The mraiory execution unit 212 is structured and configured to function as a 
local memory within the microprocessor 200. The memory execution unit 212 • 
includes a unified data and instruction cache 214, a data Translation Lookaside Buffer 
(TLB) 216, and a memory ordering logic 218, The memory execution unit 212 
receives instruction fetch requests 220 fitjm a microinstruction translation engine 
(MTIE) 224 and provides raw instructi^^ TheMITE224 
decodes the raw instructions 225 received fi^om the mraaoiy execution unit 212 into a 
coirespondiag set of microinstructions, also referred to as mioo-operations. Decoded 
microinstructions 226 are sent by the MTTE 224 to a trace delivay engine (TDE) 230. 

The TDE 230 fimctions as a microinstruction cache and is the primary source 
of microinstmctions for a downstream «ecution unit 270. The TDE 230 includes a 
trace cache 232, a trace brandi predictor (BTB) 234, a micro-code sequencer 236, and 
a micro-operation (uop) queue 238, By having a microinstraction caching function 
within the processor pipeline, the TDE 230 and specifically the trace cache 232 can 
leverage the woric done by the MTTE 224 to provide a relatively high microinstruction 
bandwidth- In one embodiment, the trace cadie 232 may comprise a 256 entry, 8 way 
set associate memory. The term **trace", in one embodimoit, refers to a sequeuce of 
microinstructions stored as entries in the trace cache 232 with each entry having 
pointers to preceding and proceeding microinstructions in the trace. TherefiDre, the 
trace cache 232 can facilitate high-performance sequencing in tiiat the address of the 
next entry to be accessed to obtain a subsequent microinstruction is known before a 
currmt access is completed. The trace cache branch predictor 234 provides local 
branch predictions with respect to traces within the trace cache 232, The trace cache 
232 and tiic miorocode sequencer 236 provide microinstructions to the micro-op 
queue 238. 

The microinstructions are then fed from the micro-op queue 238 to a cluster 
(also referred to as the Rename, Reservation Station, Replay, and Retkement or 
RRRR cluster) 240. The RRRR clusta: 240, in one embodiment, is responsible for 
controlling the flow of the microinstructions received from the TDE 230 through the 
rest of the microprocessor 200. The functions performed by the RRRR cluster 240 
include allocating the resources used for tiie execution of the microinstmctions 
received from TDE 230; converting references to external or logical registers into 



wo 01/48599 PCT/USOO/32241 

intonal or physical register references; scheduling and dispatching the 
microinstructions for execution to an execution unit 270; providing those 
nucroinstnictions that need to be re-executed to the execution unit 270; and retiring 
those microinstructions that haVe, completed execution and are ready for retirement 
The stnicture and operation of the KEtElR chister 240 are described 
below. In the event that the resources are insufficient or unavailable to process a 
microinstruction or a set of microinstructions, the RRRR cluster 240 will assert a stall 
signal 282 that is propagated to the TDE 230. The staU signal 282 is thai updisited and 
sent by the TOE 230 to the MTTE 224, 

The microinstructions that are ready for execution are dispatched from the 
RRRR cluster 240 to the execution unit 270. hi one embodiment, the execution unit 
270 includes a floating point execution engine 274, an integer execution engine 276, 
and a level 0 data cache 278, In one embodiment in ^ch the microprocessor 200 
executes the IA32 instruction set. 

Figure 3 shows a block diagram of one embodunent of the RRRR cluster 240 
described in Figure 2 above. The RRRR cluster 240 as shown in Figure 3 mcludes a 
registOT allocation table (RAT) 301, an allocator and free-list manager (ALF) 31 1, an 
instruction queue (IQ) 321, a reorder buffer (ROB) 33 1, a scheduler and scoreboard 
unit (SSU) 341, and a checker and rq)lay unit (CRU) 351, 

In the present embodiment, the TOE 230 delivers the microinstructions 
(UOPs) to both the ALF 3 1 1 and the RAT 301. The ALF 3 1 1 is responsible for 
allocating most of tiie resources needed for the execution of the UOPs received from 
tiie TOE 230, The ALF 3 1 1 includes a free-list manager structure (FLM) 3 1 5 that is 
used to maintain a history of register allocation. The BAT (also referred to as regist^ 
renamer) 301 renames the logical registers specified in each UOP to the ^propriate 
physical register pointers to remove the .dependencies caused by register reuse. Once 
the ALF 3 1 1 and the RAT 301 have completed their corresponding functions, the 
UOPs are sent to the IQ 32L for temporary holding prior to being dispatched for 
execution by the SSU 341 . In Hxe embodiment shown in Figure 3, the IQ 321 is 
responsible for providing the information about each UOP to the SSU 341 so that the 
SSU 341 can dispatch the respective UOP to the proper execution unit based on data 
dependency. In one embodunent, the IQ 321 includes a memory mstruction address 
queue (NflAQ) 323, a general instruction address queue (GIAQ) 325, and an 
instruction data queue (IDQ) 327, In one embodiment, the MIAQ 323 and the GIAQ 



8 



wo 01/48599 PCT/USOO/32241 

325 are used to hold and feed certain time-critical information to the SSU 341 as 
quickly as needed. The time-critLcal infoimation include the UOP's sources and 
destinations, UOP latency, etc. Depending on the type of input UOP, tiie AUF 31 1 
determines wh^er the MIAQ 323 or the GIAQ 325 will be used to hold the time- 
critical information for the respective input UOP. The MIAQ 323 is used for niemory 
UOPs (i.e., UOPs that requnre memory access). The GIAQ 325 is used for non- 
memory UOPs (i.e., UOPs that do not require memory access). The IDQ 327 is used 
to hold the less time-critical information such as the opcode and immediate data. 

When a UOP's sources are ready and an execution unit is available, flie SSU. 
341 schedules and dispatches the UOP for execution. There are instances when some 
UOPs may produce incorrect data, for example due to a level 0 data cache miss. If a 
* particular UOP produces incorrect data or uses mcoirect data in its. execution, the 
CRU 351 will be mformed of the need for this particular UOP to be r^executed or 
replayed until the correct results are obtained. The checks: of the CRU 351 examines 
each UOP after its execution to determine whether the respective UOP needs to be re- 
executed. If so, the replay manager of the CRU 351 is responsible for re-dispatching 
the respective UOP to the appropriate execution unit for re-execution. If the checker 
determines that a particular UOP does not need to be re-executed, that particular UOP 
will be sent to the ROB 331 for retirement 

The ROB 33 1 is responsible for retiring each UOP in its original logical 
program order once its execution has been conq)leted and it is ready for retirement 
(i.e. no replay). In addition, the ROB 331 is responsible for handUng internal and 
external events. Examples of internal events include exceptions signaled by the write 
back of various UOPs such as floating point denonnal assist or other events signaled 
by UOPs fliat need to be handled by the microcode (e.g., assists). External events 
include interrupts, page feult, SMI requests, etc. In one embodiment, the ROB 331 is 
the miit responsible for ensuring that all events are serviced in accordance with the 
architectural requirements of the microprocessor. There are several conditions like 
evmts, interrupts, halt, reset, etc. that will cause ttie machine to change mode or to 
switch between MT and ST configuration. Whenever the ROB 33 1 detects such a 
condition, it asserts a signal or a set of signals (referred to as CKNuke herein) which 
causes all the UOPs bemg processed but not retired or committed to be, flushed. The 
ROB 331 then provides the TDE 230 with the address of the microinstruction from 
which to start sequencing UOPs to handle the event. For example, if the memory 
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duster detects a page feult exception on a load UOP, it will transmit a signal to the 
ROB 331 to alert the ROB 331 of this event. When the ROB 331 reaches this load . 
UOP, it will assert the signal CRNuke and not commit any state for any of the UOPs 
including the load UOP and those following it. The ROB 33 1 will then send the 
appropriate information to the TDE 230 to start sequencing UOPs to service the page 
&ult exception. 

hi one embodiment, the ROB 33 1 is responsible for detecting and controlling 
transitions of the machine fiom single thread inode to multi-fhread mode and back. It 
performs its correspondmg function by detecting certain events which can be either 
intonal or external and asserting CRNuke to the rest of the machine and also asserting 
signals to communicate the new state of the machine. The rest of the machine reacts 
to the CRNuke signal and the new state signals to &a!lxsc or exit MX mode or ST mode. . 

In one embodhnent, the resources that are allocated by the ALF 3 11 fortiie 
execution of the incoming UOPs include the following: 

1. Sequence number given to each UOP to track the original logical program 
order of the respective UOP. In one embodiment, the sequence number given to each 
UOP within a particular thread is unique with respect to other UOPs within that 
particular thread. The sequence number is used for the in-order retirement of the 
UOPs once thek executions are completed. The sequence number of each UOP is 
also used in the event that the input UOPs are to be executed in-order. 

2. Entry in the Free list Manager (FLM) 3 1 5 given to each UOP to aUow the 
rename history of the respective UOP to be tracked and recovered in case there is • 
problem with the execution of a particular UOP and it needs to be re-executed. 

3. Entry in tiie Reorder Buffer (ROB) 331 given to each UOP to allow the 
respective UOP to be retired hiH>rder once its execution is completed successfully and 
flie UOP is ready to be retired. 

' 4. Entry in the physical register file given to each UOP to store the operand 
data needed for the execution of the respective UOP and tiie result produced 
thereftom. 

5. Bitry in the Load Buffer given to each UO? that needs to receive data from 
the MEU 212 (also referred to as the memory execution cluster). 

6, Entry in the Store Buffer given to each UOP that is to produce some data to 
be stored in the m^ory execution cluster. 
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7. Entry in the IDQ 327 given to each UOP to hold the instruction information 
before flie respective UOP is disfpatched by the SSU 341 to the execution unit 270. 

8. Entry in the MIAQ 323 given to each memory UOP or entry in the GIAQ 
325 given to each non-memory UOP to hold the time-critical information for the 
respective UOP before it is dispatched by Ihe SSU 341 to the execution unit 2l0. 

In one embodiment, Hie ALF 31 1 is responsible &r determining which 
resources are required for the execution of an input UOP received fiom the TDE 230 
and how much of required resources need to be allocated for the respective UOP, For 
example, the ALF 311, upon receivmg a UOP from the TDE 230, will determine 
whether a load buffer entry is needed and will allocate the appropriate load buffer 
entry for the respective UOP if there is an enliy available in the load buffer. Ifno 
entry is available, the ALF 311 will generate a stall isignal 282 as shown in Figure 2 to 
inform the TDB 230 and other units wittun the processor that the incommg UOP 
cannot be allocated and certain units within the processor, for example the TDE 230, 
need to stall until the stall conditions are cleared. In one embodiment, the ALF 311 
provides the appropriate allocation infoimation (e.g. allocation pomters) to other units 
within the RKRR cluster 240 including the IQ 321, the RAT 301, the ROB 331 and 
other units outside of the RRRR cluster 240, for exanq)le the MEU 212 (Fig. 2). 

hi one embodiment, the ALF 311 uses certain information main t ained by oth^ 
units such as a set of pointers refenred to as tail pointers to determine, with respect to a 
particular resource such as a load buffer, the amount of free entries available for 
allocation. The ALF 311 also receives other signals such as clear signals due to 
branch misprediction (e.g., JEClear and CRClear) that are used to determine whether 
to generate a stall signal. 

In one embodiment, the microprocessor 200 can operate in either a single 
thread (ST) mode or a multithread (MT) mode based upon a control input signal. In 
one embodiment, the control input signal indicating whether the microprocessor 200 
is to operate in ST or MT mode is provided by the operating system. As explained 
above, the ALF unit 31 1, in the present embodiment, is responsible for allocating 
most of the processor resources that are used for the execution of a particular UOP in 
a particular thread. The various resources allocated by the ALF unit 3 1 1 include the 
ROB 331, the FLM 315, the ^OAQ 323, the GIAQ 325, the IDQ 327, the load buffer 
(LB) (not shown), the store buffer (SB) (not shown), and the physical register file 
entries that are required by the input UOPs. Each of the resources mmtioned above 
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contains a predcteimined number of resource elements or entries that are to be 
allocated based i^on the need of the re^ective UOPs and the availability of those 
resource elements or entries* In one embodiment^ for example, the ROB 33 1 contains 
126 entries, the FLM 315 contains 126 entries, the IDQ 327 contains 126 entries, flie 
GIAQ 325 contains 32 entries, the MIAQ 323 contains 32 entries, the load buffer 
contains 48 entries, &e store buffer contains 24 mtries, and tiie physical register file 
contains 127 entries. 

In the discussion tiiat follows, it is assumed that &ere are two threads, ttiread 0 
(TO) and thread 1 (Tl) that can be executed concurrently by die microprocessor 200 in 
MT mode or executed individually in ST mode. However, the teachings of the 
present invention should not be limited to two threads and everything discussed h^ein 
equally applies to a processing envixoiunent in which more than two threads are 
executed concurrently. In addition, the discussion below is focused on the resource 
computation and allocation performed by the ALF unit 311 with respect to one 
exenq)laiy queue;, referred to hereinafter as Q, that is configured to operate as a 
circular queue or buffer. However, the teachings of the present invention is equally 
applicable to any ottier processor resoim^e or any other data structure including, but 
not limited to, a non-circular queue structure, a linked*list stmcture, an array stmcture, 
a tree structure, etc. 

hi ST mode or ST configuration, each processor resource used in the execution 
of the UOPs is allocated to the •'working" thread, either ttoead 0 or thread 1. The 
working thread is the particular thread to which the current set of UOPs received &om 
the TDB 230 belong witti respect to the current processing period. In one 
embodiment, the TDE 230 supplies as many as three valid UOPs per a processing 
clock cycle. All valid UOPs in each clock cycle are tagged with one thread bit 
indicating the particular thread to which the respective allocation clock belongs. The 
thread bit is used to identify which of the two threads is the current working thread, hi 
addition, ftie TDE 230 is responsible for supplymg the correct valid bits for the set of 
UOPs that the TDE 230 delivers to the RRRR cluster 240, Each UOP received fiom 
the TDE 230 therefore is tagged witii a valid bit indicating whether the respective 
UOP is a valid UOP. When the TDE 230 has no valid UOPs to be allocated, the TDE 
230 is responsible for drivmg the valid bits to mvalid status. The UOPs within each 
thread are delivered by the TDE 230 to the RRRR cluster 240 in ttieir origmal 
sequential program order. 
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La MT mode or MT configuration, each of the queues or buffers used for the 
execution of the UOPs is partitioned into two portions, one portion is to be used for 
thread 0 and the other portion is used for thread 1. In one embodiment, the two 
portions are sized equally so that each thread is given the same number of queue or 
buffer entries. M one embodiment, the physical registers are a common resource to be 
shared by thread 0 and thread 1 on a first come, first served basis and ftcre is no 
partition of the physical registers between the two threads. 

In one embodiment, the queues or buffers to be allocated are configured as 
circular queues or circulaf buffers. Accordingly, once the CTd of a queue or a buffer is 
reached, allocation for subsequent UOPs will wrap around and start at die beginning 
of the queue or buffer. The wr^ around operation with respect to a circular queue or 
buffer is described in greater detail below in conjunction with the various operations 
performed by the ALF 3 11 in doing resource computation and resource allocation for 
each resource. 

In one embodiment, the ALF 311 utilizes a separate set of pointers for each 
thread with respect to each resource in order to perform the resource conq)utation and 
allocation for each thread. As such, there are two separate sets of pointers associated 
with each resource. Each iset of pointers includes a head pointer, a tail pointer, and a 
stall pointer. The head pomter is used for the allocation of the queue entries. For 
example, if the head pomter for a particular queue points to entry 1 in that queue, then 
entry 1 is the entry to be allocated for the respective UOP. Once entry 1 is allocated, 
the head pointer is advanced to the next ratry in the queue, entry 2. The tail pointer is 
used' for the deallocation of queue entries. For example, if the tail pointer for a 
particular queue points to entry 1 in that queue, then entry 1 is the entry to be fireed 
once the execution of the respective UOP is completed. Once entry 1 is deallocated or 
fireed up, the tail pointer is advanced to the next entry to be deallocated, entry 2. The 
stall pointer is used to determine whether there are enougji firee queue entries to 
accommodate the next allocation. For example, if the stall pointer for a particular 
queue points to entry 3 and the inconmig UOPs require three entries for their 
allocation, then the stall pointer will point to entry 6 if there is enough room in the 
queue to allocate three entries for the input UOPs. In one embodiment, the value of 
the stall pointer is compared with the value of the tail pointer to determine whether 
there is enough room for the required allocation. 
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Itt one embodiment, the aUocationpoUcy used for the ROB 331, ELM 315, and 
IDQ 327 is a block allocation of three. Accordingly, if there is any valid UOP in the 
set of three input UOPs then 3 entries in those queues will be used even if not all of 
the inputs UOPs are valid. The allocation policy with respect to the GIAQ 325, 
MIAQ 323, load buffer, and store buffer is based upon the actual requirements of the 
input UOPs. Accordingly, entries in those queues will be allocated only if the inpvt 
UOPs require them. 

Figures 4a and-4b illustrate an example of a circular queue Q containing a 
predetermined number of entries, for exan^le 16, that is used for the allocation of tlie 
working thread in ST mode and for the allocation of both threads 0 and 1 in MT 
mode. In ST mode, depending \spon which thread is the working thread, either the 
thread 0 or thread 1 pointers are used for the resource computation and allocation with 
respect to Q. In MT mode, two sets of pointers are used. Referring to Figure 4a, it is . 
assumed that thread 0 is the working thread in ST mode. The circular queue Q 
contains 16 entries, entry 0 through entry 15, Since thread 0 is the working thread, the 
set of pointers used for the allocation in this case is the thread 0 set of pointers: 
TO_TAIL_PTEl,TOJffiADjnrR,andTO_STALL_PTR. Since the entire queue is 
assigned to the workmg thread in ST mode, the end of queue in this case points to 
entry 15 in the queue. Since the queue Q is circular, the TOJTAILJPTR, 
TO JBEAD_PTR, and T01STALL_PTR will wrap around when they are advanced 
past entry 1 5 in the queue. In ST mode, the allocation in the exen5)lary queue Q is 
performed serially starting fiom «itry 0. Since the pointers can wr^ around, it is 
necessary to keep track of the vnap around situation so that the resource computation 
and resource allocation with respect to Q can be performed correctly. In one 
embodunent, a wr^ bit is used to keep track of fte wr^ around situation with respect 
to each pointer. At the start of thread 0 execution or in response to a break event as 
described above, the thread 0 pointers and their associated wrap bits for Q are 
imtialized to their apprx>priate values assuming that thread 0 is the working thread. In 
one embodiment TO^TAIL^FER, TO_HEAD_.PTR. and TO^STALL^PTR are 
initialized to point to the first entry, i.e. entry 0, in the queue Q and their 
coirespondmg wr^ bits are set to 0, to indicate that the queue Q is empty at this stage. 
As entries for the input UOPs are being allocated, the TO_HEAD_PTR is updated to 
reflect the allocations made. Similarly, as entries in the queue are being freed up or 
deallocated, the TOJTAILJPTR is updated accordingly to reflect the deallocations 
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made. The TO_STAIX_PTR is used to compute whether 4ere are 
entries in the queue to accommodate flie input UOPs. The value of the wrap hit for 
each pomter is toggled between 0 and 1 each time that particular pointer is advanced 
past the end of queue, i.e. entiy 15. 

Figure 4b illustrates an exenq)laiy queue Q containing 1 6 entries that arc to be 
allocated for thread 0 and thread 1 in MX mode. One portion of the queue, aitries 0-7, 
is reserved for thread 0 allocation while the other portion of the queue, entries 8-15, is 
reserved for thread 1 allocation. Accordingly, tiiread 0 end of queue (T0_EOQ) points 
to entry 7 and thread 1 end of queue (Tl^EOQ) points to entry 15. One set of pointers 
(TO_^TAIL_PTR, TO^HEAD^PTR, TO_STALL_PTR) is used for resource 
computation and allocation with respect to the portion reserved for thread 0 and 
another set of pointers (T1.TAIL.PTR, Tl JEIEAD^PTR, T1_^STALL.PTR) is used 
for tiie resource computation and allocation with respect to the portion reserved for 
tiiread 1 . At the start of MT execution mode, the respective pointers for thread 0 and 
thread 1 are initialized to theh corresponding values based upon the partitioning 
schCTie implemented, hi this example, since the queue is partitioned into two equal 
portions, TO pomters are initialized to point to the begmning of the queue, i.e. entry 0, 
and the Tl pomters are mitialized to pomt to the middle of the qu«ie, i.e., entry 8. 
The coirespondmg wrap bits for bofli thread 0 and flnead 1 pomters are also initialized 
accordingly, for example to 0. In this example, both portions of the queue Q are 
configured to be circular. Accordingly, TO pomters will wrap around to entry 0 as 
fliey are advanced past entry 7. Sunilarly, Tl pomters mU wrap around to entry 8 as 
diey are advanced past entry 15. A separate wrap bit is used to keep track of the wrap 
around situation for each pointer of each thread with respect to its respective portion 
of the queue. For example, the value of the wrap bit for each thread 0 pomter 
(TO^WBIT) is toggled each time that particular thread 0 pointer is advanced past entry 
7. Similarly, the value of the wr^ bit for each tiiread 1 pointer (Tl^WBIT) is toggled 
each time that particular thread 1 pointer is advanced past entry 1 5. The values of 
TO_TAIL.PTR, T0_STALL_PTR and tiie corresponding wrap bits for TO_TAIL_PTR 
and TO_STALL_PTR are used to determine flie number of entries available in TO 
portion of flie queue for tiiread 0 allocation. Similarly, tiie values of T1_TAIL_PTR, 
Tl STALL^PTR and the corresponding wrap bits for tiiese two pointers are used to 
determine Uie number of entries available in Tl portion of the queue for tiiread 1 
allocation. 
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In the present embodiment, if the required resources are not available to 
execute the input UOPs within a particular thread, the ALF unit 3 11 wiU generate a 
stall signal with respect to that particular thread to mfonn the TDE 230 and other units 
within the microprocessor that they need to stall until the stall conditions are cleared 
In one embodiment, stalling means that the ALF 3 1 1 and the RAT 3 0 1 will not get 
any new valid UOPs to allocate and therefore no new valid UOPs will be transferred 
to the rest of the pixxsessor down the pipeline. In addition, the TDE 230 needs to stop 
fetching new UOPs because the last set of UOPs fliat were fetched cannot be allocated 
due to flie stall condition caused by insufficient resources. Since there are two 
threads, thread 0 and thread I, that can be executed concurrently, either thread 0 or 
thread 1 or botii Oreads can be stalled due to uisufGcient resources. Accordingly, 
there are two sq>aiate stall signals, one for each ttiread, that can be activated by the 
ALF 3 1 1 if there are insufEcirat resources to satisfy the resource requirements of the 
n5>utUOPs. The thread OstaU signal, retod to as ALSTALLTO, is activated by the 
ALF 311 if there are not enough resources to allocate the input UOPs in thread 0. 
Likewise, the thread 1 stall signal, refored to as ALSTALLTl. is activated by the 
ALF 311 if there are not enough resources to allocate the iiq)ut UOPs in thread 1. 

In one embodiment, the ALSTALL signals for both threads 0 and 1 are 
determined in every clock if the processor is running in MT mode. In ST mode, there 
is only one ALSTALL signal. It can be either ALSTALLTO or ALSTALLTl based 
upon the working thread. In MT mode, when only ALSTALLTO is asserted, the TDE 
230 can drive one of the foUowmg to the RRRR cluster (1) valid UOPs from thread 
1 ; (2) invalid UOPs; or (3) flie staUed UOPs from thread 0. Similarly, if only 
ALSTALLTl is asserted, the TDE 230 can drive to flie RRRR cluster either valid 
UOPs from thread 0, invalid UOPs, or the stalled UOPs from thread 1 , When both 
ALSTALLTO and ALSTALLTl are asserted, flie TDE 230 can drive either the stalled 
UOPs fiom thread 0, flie stalled UOPs from thread 1, or invalid UOPs, hi one 
embodiment, the earliest that the ALF '3 1 1 will be able to allocate the stalled UOPs is 
two clocks after the stall signal corresponding to that thread becomes haactive. In 
order for the ALF 3 1 1 to allocate in two clocks, the TDE 230 needs to drive the 
stalled UOPs on the last clock that the stall signal corresponding to that thread is still 
active. 

In one embodiment, to enable the stall computation, an additional clock is 
added to the RRRR interfece with the TDE. The first clock is used to perform the 
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stdl computation and the aUocation is done in the next clock if ^ the 
stall is computed for a set of three UOPs every medium clock. As mentioned above, 
actual resource computation and allocation is done with respect to the GIAQ 325, 
MIAQ 323, load buffer, and store buffer while block computation and allocation is 
done with respect to the FIM 315, ROB 331, and the IDQ 327. As discussed above, 
the stall signal for a particular thread will be activated if there are insufBciait entries 
in one or raoie of &e resources to allocate for the input UOPs. 

M one embodiment, there is a separate stall block computation for each thread. 
In one embodiment, when the processor is running in MT mode, the stall computation 
for thread 0 and the stall computation for thread 1 are performed in parallel in every 
clock even thougih there is only one thread to be allocated in each clock. 

Figure 5 illustrates a high level flow diagram of one embodimrat of a mediod 
500 fpr managing various resources in the multithreaded processor 200. In one 
embodiment, a control signal indicating the corresponding execution mode is set to a 
first value, for example 0, to indicate that tiie single flireading mode is active and set 
to a second value, for example 1, to indicate that the multithreading mode is active. In 
one embodiment, the processor 200 waits for the state recovery to complete before it 
can transition fix)m one execution mode to another execution mode. Whenever the 
ROB 3 3 1 detects an event condition, it asserts the signal CRNuke which causes all the 
UOPs being processed but not yet retired or committed to be flushed. 

With continuing reference to Figure 5, the method 500 starts at 501. At 
decision block 505, the method 500 proceeds to block 509 if m event has been 
detected Otherwise itproceeds to block 541, In one embodiment, an event can be an 
internal or external event or a condition detected by the ROB 33 1 which then 
generates a C3lNuke signal as described above. In one embodiment, one or more 
signals are generated to indicate fliat state recovery is complete. One example of such 
a signal is that after the state in the RAT 311 is recovered and all physical registers are 
freed then a state recovery done signal is asserted. When all such state recovery done 
signals are asserted then the state recovery is considered complete. At decision block 
509, the metiiod 500 proceeds to block 513 if the state recovery is completed. At 
block 513, the tiaread active hits are latched. The method 500 then proceeds from 
block 513 to decision block 517 to determine whether the processor is to run in MT or 
ST mode. The method 500 then proceeds to block 521 if MT mode is indicated and to 
block 531 if ST mode is indicated. At block 521, tiie allocation pointers for both 
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fhieads 0 and 1 are initialized according to a piedetennined MT sdieme. At block 
S3 1, the allocation pointers for ttie working tfiread are initialized according to a 
pi^tennined ST scheme. The method then proceeds ficm either block S21 or block 
53 1 to block 541 to perfonn the resource allocation task according to either the ST 
scheme or MT scheme. The method then loops back from block 541 to block 505. 

Figure 6 shows a high level block diagram of a resource allocation process 
600 perfoimed at block 541 in Flgare 5, The process starts at block 601 and proceeds 
to block 605. At decision block 605, the process proceeds to block 61 1 to perfioim 
resource allocation in MT mode if MT mode is mdicated. Otherwise, it proceeds to 
block 621 to perform resource allocation in ST mode. The process then proceeds to 
end at block 691. 

Figure 7 iff a high level flow diagram of one embodiment of a resource 
allocation process in MT mode 700 performed at block 61 1 in Figure 6. In this 
embodiment, the allocation process is perfoimed in parallel for both thread 0 and 
thread 1 . The process starts at block 701 and proceeds in parallel to both blocks 705 
and 715 to perform stall confutation for thread 0 and thread 1, respectively. The stall 
computation for thread 0 and thread 1 will be discussed in more detail below. The 
process then proceeds in parallel from blocks 705 and 715 to blocks 707 and 717, 
respectively. At decision block 707, the process proceeds to block 709 if TO is not 
stalled. Otherwise the process proceeds to end at block 791. At decision block 717, 
the process proceeds to block 719 if Tl is not stalled. Otherwise the process proceeds 
to Old at block 791. At block 709, resource allocation for thread 0 is perfoimed At 
block 719, resource allocation for thread 1 is p^onned. The resource allocation 
perfoimed at blocks 709 and 719 will be described in more detail below. The process 
then proceeds to end at block 79 1 . 

Figure 8 illustrates a high level flow diagram of another embodunent of an 
MT resource allocation process 800 poformed at block 61 1 in Figure 6. In this 
embodiment, both the stall computation and the resource allocation for each of the 
two threads 0 and 1 are performed in a multiplexed manner based upon the thread ID 
associated with tiie input UOPs received from the TDE 230. The process begins at 
block 801 and proceeds to decision block 805. At decision block 805, the process 
piDceeds to block 811 if the UOPs received from the TDE come from thread 0. 
Otherwise, the process proceeds to block 821 . As described above, in one 
embodiment, each UOP received from the TDE 230 is tagged w;th a thread bit 
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indicating the particular thread to which it belongs. In one embodiment where there 
are two threads being executed concurrently in the MT mode, the thread bit is set to 
one value, for exanq)Ie 0, to indicate that the respective UOP is in thread 0 and set to 
another value, for example 1, to mdicate that the respective UOP is in thread I. At 
block 811, stall computation for thread 0 is performed to determine wheth^ there are 
sufficient resources to execute the mput UOPs fiom thread 0. The process then 
proceeds fix>m block 811 to decision block 815. At decision block 815, the process 
proceeds to block 819 to perform resource allocation for the respective UOP if there 
are sufficient resources available. Otherwise the process proceeds 1o end at block 891. 
Referring back to block 821, stall conq)utation for thread 1 is performed in this block 
to determine whether there are sufficioit resources to execute the ir^^ut UOPs from 
thread 1 . The process then proceeds from block 821 to decision block 825. At 
decision block 825, file process proceeds to block 829 to allocate the necessary 
resources for the respective UOPs if there are suffidait resources available. 
Otherwise, the process proceeds to end at block 89L The process then proceeds &om 
either block 819 or block 829 to end at block 891 . 

Figure 9 is a high level flow diagram of anodier embodiment of an MT 
resource allocation process 900 performed at block 61 1 m Ilgure 6. In this 
embodunent, the stall computations for thread 0 and thread 1 are performed m parallel 
while the resource allocations for thread 0 mid thread 1 are multiplexed. The process 
starts at block 90 1 and proceeds to perform the stall computation foo: thread 0 and 
thread 1 m parallel at block 905 and block 909, respectively. The process then 
proceeds fiom blocks 905 and 909 to decision block 913 to determine whether the 
iirput UOPs received m the current clock cycle belong to thread 0 or thread 1. As 
described above, each ii^ut UOP received fiom the TDE 230 is tagged with a tag bit 
indicating the corresponding thread to which the respective UOP belongs. The 
process then proceeds from decision block 913 to block 915 if the respective UOP 
belongs to thread 0, otherwise it proceeds to block 917. At decision block 915, the 
process proceeds to block 921 to perform resource allocation for thread 0 if thread 0 is 
not stalled. Otherwise, the process proceeds to end at block 991. At decision block 
917, the process proceeds to block 931 to perform resource allocation for thread 1 if 
fliread 1 is not stalled, Otherwise the process proceeds to end at block 991. The 
process then proceeds from either blodc 921 or 93 1 to end at block 99 1 . In one 
embodiment, both the resource computation and the resource allocation tasks, are 
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performed m the same clock cycle. In another ranbodiment, the. stall computation for 
each thread is performed in one clock cycle while the resource computation for the 
working thread is performed in the next clock cycle. 

Figure 10 shows a high level flow diagram of one embodiment of an ST 
resource allocation process 1000 performed at block 621 m Figure 6. As discussed 
above, in the ST running mode, there is only one thread being executed and it is 
considered the working thread. The process begins at block 1001 and proceeds to 
decision block 1005. At decision block 1005, the process proceeds to block lOU if 
&e working thread is thread 0 or to block 1021 if the working thread is thread 1 . As 
mentioned above, a tiiread active bit is maintained with respect to each thread being 
executed to indicate wh^her that particular thread is being active. In ST mode, either 
thread 0 or thread 1 is the working thread. In one embodiment, a separate thread 
active bit is maintained for each thread and is set to a first value to mdicate that it is 
active and set to a second value otherwise. At block 1011, stall computation is 
performed with inspect to thread 0 to detemoine whether there are sufficient resources 
to execute the input UOPs received from the TDE 230. The process then proceeds 
from block 1011 to decision block 1013. At decision block 1013, the process 
proceeds to block 1015 to allocate the necessary resources for the respective thread 0 
UOPs if the thread 0 stall signal is inactive. Otherwise, the process proceeds from 
decision block 1013 to end at block 1091. Referring back to decision block 1005, tiie 
process proceeds to block 1021 if the working thread is thread 1 . At block 1021, stall 
computation is performed with respect to thread 1 to determine whether there are 
sufficient resources to execute the input thread 1 UOPs received from the TDE 230. 
At decision block 1023, the process proceeds to block 1025 to allocate the necessary 
resources for the execution of the input thread 1 UOPs if flie thread 1 stall signal is 
inactive. Ottierwise, the process proceeds to end at blodc 1091. As discussed above, 
in the ST mode, the resources are allocated to the active or working thread, either 
thread 0 or thread 1. If it is deterauned that there are msufficient resources to execute 
the input UOPs for the working thread fetched from the TDE 230, the ALF 3 1 1 
generates the ^propriate stall signal for the working thread, either thread 0 or thread 
1, to inform the TDE 230 and other units within the microprocessor that the coming 
UOPs cannot be executed. In this case, the TDE 230 needs to stall further fetching of 
UOPs to the RRRR 300 until flie conditions that cause the stall are cleared. 
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Figure 11 illustrates a more detailed flow diagram of one embodiment of the 
NfT parallel resource allocation process described in Figure 7 above. As discussed 
above, in ibis embodiment, both the stall computation and resource allocation for 
tfazead 0 and thread 1 are performed in paraUeL The process begins at block 1 101 and 
proceeds in parallel to both blocks 1 105 and 1 155. At decision block 1 105, the 
process continues to block 1 1 1 0 if the iiput UOPs for thread 0 are valid. Otherwise, 
the process proceeds to end at block 1 191 with respect to ttiread 0, At decision block 
1 155, ttie process proceeds to block 1 160 if the input UOPs for thread 1 are valid- 
Otherwise the process proceeds to end at block 1191 with respect to thread 1 , As 
discussed above, in one embodiment, each UOP received from the TDE 230 is 
siqpplied with a valid bit mdicating whether tihat particular UOP is valid. The TDE 
230 is responsible for supplying the correct valid bits for the UOPs that it fetches to 
the RRRR cluster 300. At blocks 11 10 and 1160, the ALF unit 311 determines the 
resources needed for the execution of thread 0 UOPs and thread 1 UOPs, respectively. 
The process then proceeds fiom block 1 1 10 to block 1 1 15 and from block 1 160 to 
block 1165. At block 1 1 15, the amount of resources available for thread 0 execution 
is detemiined At block 11 65, the amount of resources available for thread 1 
execution is determined. The process then continues from blocks 1 1 1 5 and 1 1 65 to 
blocks 1 120 and 1 170, respectively. At decision block 1 120, the process proceeds to 
block 1 125 to activate the stall signal for thread 0 if there are insufficient resources 
available to execute the input UOPs from thread 0. Otherwise, the process proceeds 
to block 1 1 30 to allocate tho required resources for the execution of the iiqjut fliread 0 
UOPs. The process then continues from block 1 130 to block 1 135 to update the 
resource allocation pomters for thread 0 to keep track of flie amount of resources 
allocated in block 1 130, The process then proceeds from either blbclc 1 125 or block 
1135 to end at block 1191. Refeiring back to block 1165, the process proceeds from 
block 1 165 to block 1 170. At decision block 1 170, the process proceeds to block 
1 175 to activate the stall signal for thread 1 if there are insufficient resources to handle 
the execution of thread 1 UOPs. Otherwise the process continues to block 1 180 to 
aflocate the necessary resources for thread 1 UOPs. The process then proceeds from 
block 1 180 to block 11 85 to update the resource allocation pointers for thread 1 to 
keep track of the amount of resources allocated in block 1 1 80. The process then 
proceeds from either block 1 1 75 or block 1 185 to end at block 1191. 



21 



wo 01/48599 PCTAJSOO/32241 

Figure 12 shows a more detailed flow diagram of one ^bodiment of ttie MT 
resource allocation {m)C6SS described in Figure 8 above. In this embodiment^ botii the 
resouice computation and resource allocation for thread 0 and thread 1 are 
multiplexed. The process starts at block 1201 and proceeds to decision block 1205, 
At decision block 1205, the process continues to block 1209 if the input UOPs are 
valid. Othawise the process proceeds to end at block 1291. At decision block 1209, 
the process proceeds to block 1213 to select the appropriate pointas for thread 0 if 
thread 0 is the current workup thread Otherwise, the process proceeds to block 1217 
to select the appropriate thread I pointers. The process then proceeds from either 
block 1213 or block 1217 to block 1221 to determine the amount of resources 
required to execute the input UOPs for the curr«it workmg thread. The process then 
continues to block 1225 to determine the amount of available resources for the 
working thread using the appropriate points selected. At decision block 1229, the 
process proceeds to block 1233 to activate the stall signal for the current working 
thread, either thread 0 or thread 1, if there are not enough resources to handle the 
execution oftheiiq)ut UOPs for the woridng thread. If there are enough resources 
available, the process proceeds from decision block 1229 to block 1237 to allocate the 
required resources for the current inpvit UOPs, Resource allocation pomters for the 
current working tturead are then updated accordingly at block 1241 to keep track of the 
amount of resources allocated to the working thread. The process flien proceeds from 
eithor block 1233 or block 1241 to end at block 1291. 

Figure 13 illustrates a more detailed flow diagram of one embodiment of the 
MT resource allocation process described in Figure 9 above. In this embodunent, the 
resource stall computation for both thread 0 and thread 1 are done ui parallel but the 
resource allocation is only performed for the current woridng thread, i.e., resource 
allocation for thread 0 and thread 1 is multiplexed; j 

The process begms at block 1301 and proceeds m parallel to both blocks 1305 
and 1325. At decision block 1305, flie process proceeds to end at block 1391 if the TO 
mput UOPs are invalid. Otherwise it proceeds to block 1309 to detemune the 
resources required for the execution of the TO input UOPs. The process then 
continues to block 1313 to determine Ihe amount of resources available for thread 0 
execution. At decision block 1317, flie process proceeds to block 1321 to activate the 
stall signal for thread 0 if ttiere are insuf5cient resources to handle the execution of 
the TO mput UOPs. Otherwise the process proceeds to decision block 1351. 
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Referring back to decision block 1325, the process proceeds to end at block 1391 if 
flie Tl input UOPs are invalid. Otberwise it proceeds to block 1329 to detemiine the 
lesouices required for the execution of the Tl input UOPs. The process then 
continues to block 1333 to deteraune the amount of resources available for thread 1 
execution. At decision block 1337, the process proceeds to block 1341 to activate tiie 
stall signal for thread 1 if there are insufficient resources to handle ttie execution of 
flie Tl input UOPs. Otherwise flie process proceeds to decision block 1351. At 
decision block 1351, the process proceeds to bldck 1355 to select the appropriate 
thread 0 pointers if thread 0 is the current working thread, otherwise the process 
proceeds to block 1359 to select the appropriate thread 1 pomters. The process then 
continues to block 1361 from eitha: block 1355 or block 1359. At decision block 
1361, the process proceeds to end if the stall signal for the current workmg thread is 
activated, otherwise it proceeds to allocate the necessary resources for the current 
working thread, either thread 0 or thread 1, at blodc 1371. The process ttien proceeds 
from block 1371 to block 1381 to update the appropriate allocation pointers fi)r the 
current workmg thread. The process then proceeds to end at block 1391 . 

Figure 14 shows a flow diagram of one embodiment of fte resource 
computation and resource allocation process for thread 0 according to the teachings of 
the present invention. 

As described above, in order to perform the resource computation and resource 
allocation for each tturead, the ALF 311 utilizes and maintains a separate set of 
pointers for each respective thread. For each resource, the ALF 311 maintauis a set of 
pointers including a head pointer, a tail pointer, and a stall pointer for each thread to 
compute the amount of available entries in the resource and to allocate the appropriate 
entries in the resource as required for the execution of each UOP within each thread. 
Jr the discussion that follows, the process is discussed with respect to a particular 
queue, for example, the instruction queue as one of the resources needed for the 
execution of a particular UOP within thread 0 even tiiough eveiytbing discussed 
herem is equally applicable to the resource computation and resource allocation with 
respect to other resources. As discussed above, in ST mode, each resource is wholly 
dedicated to serving the working thread, either thread 0 or thread 1. In MT mode, 
each resource is partitioned into two portions. One portion is reserved for thread 0 
while the other portion is reserved for thread 1, In one embodiment, the lower or first 
half of the queue is reserved for use by thread 0 and the upper or second half of the 
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queue is reserved for use by thread 1 . Accordingly, there are two sets of pointers that 
are used to perform the resource computation and resource allocation for each 
resourcewithrespectto thread© and thread!. In the present example, assuming that 
the size of the queue is Q, and the first entry in the queue is 0, then the end of the 
queue (EOQ) is designated as Q-lin ST mode. In MT mode, with respect to thread 
0 the beginning of queue is 0 and the end of queue is Q/2 - 1, whereas for thread 1 tfie 
beginning of queue is Q/2 and the end of queue is Q - 1 . It should be understood and 
appreciated by one skilled in the art that the teachings of the present invention should 
not in anyway be limited to the equal partitioning of ttie queue or resource. The 
teachings of the present invention axe jqjplicable to any other schemes or ways of 
resource partition (e.g., unequal partition). For example, a resource can be partitioned 
into two or more unequal portions based on various fiictors or criteria including, hut 
are not limited to, the number of threads being executed concurrently, the capacity of 
the resource, the relative processing priority of each threai etc.. As an example, a 
resource can be partitioned into two unequal portions in which Vi of the resource is 
reserved for one teead (e.g., Q/4) and % of the resource is reserved for another thread 
(3Q/4). 

Continuing with the present discussion with respect to thread 0 resource 
computation and allocation process, it should be noted, as explained above, that after 
the CRNuke signal is asserted and the event is done with respect to thread 0 and 
thread 1, the associated pomters for each resource are initialized to the appropriate 
values depending on whether the processor is to run m ST mode or MT mode. In ST 
mode, either the set of pointers for thread 0 or the set of pointers for thread 1 are 
initialized depending on whelfaer thread 0 or thread lis the wpikii^ For 
example, if thread 0 is the working ftread in ST mode, then TO_HEAD_PTR. 
TOjrAIL_PTR, and TO„STAIJLPTR are initialized to 0 and the end of queue value 
is Q - 1 where Q is the size of the particular resource to be allocated for the execution 
of ftread 0 UOPs. Sunilarly, when thread 1 is the worldng thread in ST mode, 
Tl JIEAD_PTR, T1_TAILPTR, and T1_STAIX_PTR are initialized to 0 and the 
end of queue value with respect to thread 1 is also Q • 1. In ST mode, the whole • 
resource is to be reserved for the working thread. In MT mode, however, the queue or 
the resource is partitioned into two equal portions and the sets of pointers for thread 0 
and thread 1 are set to the appropriate corresponding values. For example, in MT 
mode, TO^HBAD^PTR, TO_TAIL_FIR, and TO_STALL_PTR are set to 0 after 
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NUKE whereas Tl^HEAD.PTR, Tl^TAIL^PTR. and T1_STALL_PTR are set to 
Q/2afterNUKE. The eadofqueuewifc respect to thread 0 in MT mode is Qy2^ 
while the end ofqueue with respect to toeadl is Q- 1. By setting the pointers for 
thread 0 and thread 1 to theu: corresponding values as described above, the resource or 
qu^ to be allocated is partitioned into two equal portions. Iq one embodimait, each 
queue or bufifer to be allocated is configured as a ckcular queue or circular buffer. As 
such,, when the pointers for eadi thread with respect to a particular queue or buffer are 
advanced past their respective end of queue, these pointers are. wr^ped around. A 
map bit is used to ke&p track of the wrap around situation with respect to each pointer 
of each thread. The wrap bit for each pointer of each thread is set to a first value at 
the start to indicate tiiat the coiresponding pointer has not been wr^ped around. The 
value of the wrap bit for each pointer is toggled whai that particular pointer is 
advanced past its corresponding end of queue. For example, if TO_STAIiPTR or 
TOLTAE>JPTR for a particular queue is advanced past Q - 1 in ST mode or Q/2 - 1 in 
MT mode, then the wrap bit for TO^STALL^PTR or the wrap Wt for TO.TAILJPTR 
with respect to that particular queue is toggled. This wrap bit for each pomter is used 
in the stall computation for each resource, as described in more detail below. Again, it 
should be understood and appreciated by one skilled in the art that the teachings of the 
present invention should not in anyway be limited to the equal partitioning of the 
queue or resource. The teachings of the present invention are applicable to any other 
schemes or ways of resource partition (e.g., unequal partition). For example, a 
resource can be partitioned into two or more unequal portions based on various factors 
or criteria including, but axe not limited to, the number of threads being executed 
concurrently, the capacity of the resource, the relative processing priority of each 
thread, etc. As an example, a resource can be partitioned into two unequal portions in 
which Va of the resource is reserved for one thread (e.g., Q/4) and % of the resource is 
rested for another thread (3Q/4). 

Referring back to Figure 14, the process begins at block 1401 and proceeds to 
block 1405 to set the TO_PREV_STALL_PTR to be equal to the current 
TO^STALLJPTR. At decision block 1409, the process proceeds to block 1413 to 
select Q/2 - 1 as the end of queue value if the processor is running in MT mode. 
Otherwise, the process proceeds to block 1417 to select Q - 1 as the end of queue 
value. The process then continues Sx>m either block 1413 or block 1417 to block 
1421 to compute the number of ratries needed for this set of input UOPs. The process 



25 



wo 01/48599 PCT/USOO/32241 

proceeds to block 142^ to compute the new value for the TO^STALL^PTR. Ih one 
embodiment, the TO^STALLJPTR is incremented by the number of entries computed 
in block 1421 to obtain the new value for TO_STALL_PTIL For example, 
TO„STALL.PTR = TO_STALL„PTR + R_C3Srr where R^CNT is the number of 
entries needed that is computed in block 1421. At decision block 1433, tiie process 
proceeds to block 1437 to wrap around the new TO_STALL_PTR and toggle the 
corresponding wrap bit if it advances past the respective EOQ. Ottierwise the process 
cpntmues to block 1439. As discussed above, since the queue to be allocated here is 
configured as a circular queue, once the TO_STAIXPTR advances past the EOQ it 
needs to be wrapped around and the corresponding wrap bit needs to be toggled 
accordingly. For example, if the TO^STALL.PTR = EOQ then the TO_STALL_PTR 
is wrapped around to 0, which is the start of the respective portion of the queue 
reserved for thread 0. If the TO^STAILJPTR = EOQ + 1, then the TO:_STALL_PTR 
is wr^ped around to 1, which is the start of the respective portion of the queue plus 1, 
and so on. The process then proceeds from block 1437 to block 1439. At block 1439, 
the TO_STAIJLJ>TR is compared with TO^TAILJTR taken mto consideration tiie 
vahxes of the wrap bits associated with TO_STALL_PTR and TO_TAIL_PTR to 
determine v\diether there are moug^ free ratries in the queue to allocate the entries 
required. In one embodiment, if the wrap bit for TO^STALLJPTR is 1, the wrap bit 
for TO_TAIL_PTR is 0, and the TO^STALL^PTR is greater than the TO_TAIL_PTR 
then there is not enough room in the queue to allocate the require entries for thread 0. 
If there is not enough room to allocate the required entries or if TO_CLEAR is 
activated then the process proceeds to block 1447 to activate the stall signal for thread 
0, TO_STALL (also referred to as ALstallTO), Oth^wise Qie process proceeds to 
block 1443 to deactivate flie stdl signal for thread 0. The process thai proceeds from 
either block 1443 or block 1447 to block 1451. At decision block 1451, if 
TO^STALL is not active then the process proceeds to block 1455 to allocate the 
required entries in the queue and i^jdate the TO_HEAD_PTR to reflect the allocation 
made. Otherwise flie process proceeds to block 1459 to restore the previous value of 
the TO_STALLPTR in preparation for the next round of resource computation and 
allocation for thread 0. The process then proceeds from either block 1455 or block 
1459 to end at block 1491. 

It should be noted that while the process is described in a sequential manner, 
many of the tasks performed by die process do not need to be done sequentially and 
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can be done in parallel or in a different order provided that there is no logical 
depend^cies between those tasks. 

Figure 15 shows a flow diagram o^one embodiment of the resource 
computation and resource allocation process for thread 1 according to the teachings of 
the present invention. 

As explained above, after the C3RNuke signal is asserted and the Nuke event is 
done with respect to thread 0 and fliiead 1, the associated pointers for each resource 
are initialized to the ^propriate values depending on whether the processor is to run 
in ST mode or MT mode. 

The process begins at block 1 501 and proceeds to block 1 505 to set the 
T1_PREV_STALL_PTR to be equal to the current T1_STALL.PTR. The process 
then pioceeds to block 1509 to select Q - 1 as the end of queue value for thread 1. 
The process then continues to block 1521 to compute the number of entries needed for 
this set of input UOPs. The process proceeds to block 1 525 to compute the new value 
fortheTl_STALLJ»TR. In one embodiment, the Tl^STALL.PTR is incrOTiented 
by the number of entries computed in block 1521 to obtain the new value for 
T1_STAII._PTR- For example, TLSTAIX^PTR^TLSTALL^^ 
• where R_CNT is the number of entries needed that is computed in block 1521. At 
decision block 1533, the process proceeds to block 1537, if Tl^STALL^PTR 
advances past the corresponding EOQ. Otherwise the process continues to block 
1539. Since the queue to be allocated here is configured as a circular queue, once the 
T1_STA]XJFER advances past the EOQ it needs to be wrapped around and its 
corresponding wr^ bit needs to be toggled accordmgly. For example, if the 
Tl STALL_PTR = EOQ then the Tl_STALL_PTR is wrapped around to Q/2, which 
is the start of the corresponding portion of the queue reserved for tiiread 1, If the 
TLSTALL_PTR » EOQ + 1. then the Tl^STALL^PTR is wrapped around to Q/2 + 
1, which is the start of the corresponding portion of the queue plus 1, and so on. The 
process then proceeds from block 1537 to block 1539. At block 1539, the 
Tl^STALL.PTR is compared with T1_TAIL_PTR taken into consideration the 
values of their corresponding wrap bits to determine whether there are enough free 
entries in the queue to allocate the entries required. In one embodiment, if the wrap 
bit for T1_STALL_PTR is 1, the wrap bit for Tl^TAILJPTR is 0 and the 
T1_STALL_PTR is greater than the T1_TAIL_PTR then there is not enough room in 
the queue to allocate the require entries for thread L If there is not enough room in 
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the queue to allocate the inquired entries or if the Tl^CLEAR sigaal is activated, the 
process proceeds to block 1547 to activate the stall signal for thread 1, T1_STAIX 
(also referred to as ALstallTl). Otherwise the process proceeds to block 1543 to 
deactivate flie stall signal for thread 1 . The process then proceeds from either block 
1543 or block 1547 to block 1551. At decision block 1551, if Tl^STALL is not 
active then the process proceeds to Mode 1555 to allocate the teqmted entries in the 
queue and update the Tl jaEADJ?TR to reflect the allocation made. Otherwise the 
process proceeds to block 1 559 to restore &e previous value of the T1_STAIJL_PTR 
in preparation for ttie next round of resource computation and allocation for thread 1. 
The process then proceeds from either block 1555 or block 1559 to end at block 1591. 

It should be noted fliat while tiie process is described in a sequential manner, 
many of the tasks performed by the process do not need to be done sequentially and 
can be done in parallel or in a diflFermt order provided that there is no logical ^ 
dependencies between those tasks. 

Figure 16 shows ablock diagram ofone embodiment of an apparatus for 
performing the stall computation for thread 0 and thread 1. In this embodiment, the 
stall computation for both thread 0 and thread 1 are clone in parallel in every clock 
cycle even thoxigih resource allocation is perfonned for one thread at a time. Li the 
discussion that follows, thread 0 will also be refOTed to as the blue thread and thread 
1 will also be referred to as the red thread Accordingly, various operations or 
pointers associated with thread 0 will also be referred to as the **blue" operations or 
*T)lue** pointers, for example TO_STALL_PTR will also be referred to as the 
BLUESTALLPTR. Similarly, various operations or pointers associated with thread 1 
will also be referred to as the **red" operations or •ted" pomters, for example 
Tl_STAlL.FmwiUalsoberefen^toastheREDSTALIJ^ In addition, the 
discussion below will focus on the operations and computations with respect to thread 
0 (the blue thread) althou^ everythmg discussed herein equally applies to the other 
thread (thread 1 or the red thread). 

In one raibodiment, the stall computation unit shown m Figure 16 can contain 
several logical blocks that oparate together to perform the stall computation with 
respect to a particular fliread, e.g., thread 0, and to activate the appropriate stall signal 
if certain conditions are satisfied. These logical blocks include: a first block that 
performs UOP decoding and counting to detcmiine the number of entries required in a 
particular resource to execute the mput UOPs; a second block that computes die 
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avaUable entries in the resource; a third block that contains stall conditions related to 
CRClear or CRNuke conditions that are driven fiom state machines; a fourth block 
that perfonns stall pointer computation that needs to be evaluated and used in the stall 
conq)utation in the next clock. A detailed description of the fourth block is provided 
below with respect to Figure 17. hi this embodiment, as mentioned above, the stall 
computadon for both thread 0 and thread 1 m done m parallel in every clock cycle 
even though resource allocation is performed for one fliread at a time. 

Referring again to Figure 16, a set of three input UOPs and their 
corresponding valid bits 1607 are ii5)utted into a basic decode logic 1613. The basic ' 
decode logic 1613 decodes tiie mput UOPs and provides the decoded information to a 
counting logic 1617 tiiat counts the number of entries required based upon the types of 
the input UOPs. The ou^uts from the counting logic 1617 are then latched by the 
latchmg device 1621 which provides the appropriate select signals to the selector 1637 
based upon the number of entries required as determined by the counting logic 1617. 
As shown in Figure 16, the three iiiputs CO, CI, and C2 of the latching device 1621 
. are set as follows by the counting logic 1617: CO is set to 1 if there is exactly one 
entry required; CI is set to 1 if tiiere are exactly two entries required; and C2 is set to 
1 if there are exactly three entries required. 

Referring ^ain to Figure 16, the second block, also referred to as the resource 
availability block, confutes the number of available entries in the resource as follows. 
Smce the three input UOPs 1607 may require i^p to three entries in the resource for 
thdr execution, there are three difBsrent scenarios that need to be considered. The 
first scenario is that the numbw of entries required by the mput UOPs is one and the 
resource has at least one free entry to allocate which is sufficient. The second 
scenario is that the number of entries required by the input UOPs is two and the 
resource has at least two free entries to allocate which is sufficient. The third scenario 
is ttiat the number of entries required is three and the resource has at least three free 
entries to allocate which is sufiBcient Accordingly, the subtract logic 1 63 1 , subtract 
logic 1633, and the subtract logic 1635 are perfomied in parallel to compare three 
different values of the stall pointer with the value of the tail pointer taken into 
consideration the values of the wrap bits associated with these two pointers to 
determine the resource availability with respect to the three scenarios described above. 
The values of the wrap bits need to be considered because the resource is structured as 
a circular queue in this exanq)le, as described above. The subtract logic 1631 
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compares the value of the cuirent stall pointer plus one (StallPtr + 1) with the value of 
flie tail pointer 1629, taken into consideration the values of the corresponding wrap 
bits. The output of the subtract logic 163 1 is set to low if thjsre is at least one free 
entry in the resource to allocate. If there are no free entries in the resource, then flie 
ou^ut of the subtract logic 1631 is set to higb- Similarly, the output of the subtract 
logic 1633 is set to low if there are at least two free mtries in the resource to allocate 
and is set to hi^ otherwise. Likewise, the ou^ut of &e subtract logic 1635 is set to 
low if there are at least three free entries in the resource to allocate and set to hig^ 
oflierwise. The outputs from the subtract logic 1631, subtract logic 1633, and the 
subtract logic 1635 are then inputted into flie selector 1637, The selector 1637 selects 
either the output from subtract logic 163 1, the output from the subtract logic 1 633, or 
the output from the subtract logic 163S, depending on the select input signals from the 
latching device 1621. If the irput UOPs require only one entry, then CO is set to high 
which causes tiie selector 1637 to select the ou^ut from flie subtract logic 1631. If the 
iiq)ut UOPs require only two entries, then CI is set to high which causes tiie selector 
1637 to select flie output from the subtract logic 1633, If flie input UOPs require three 
entries, flien C2 is set to high which causes the selector 1637 to select flie output from 
the subtract logic 1635. If no entry is required by flie input UOPs, th«a CO, C 1 , and 
C2 are all set to low which causes tfie selector 1637 to select a low value as its output 
The selector 1637, based upon the select signals representing the number of entries 
required by flie mpat UOPs and the signals representing the number of entries 
available ih the resource, generates a conesponding signal indicating wheflier the 
resource has sufficient available entries. As an example, assuming that the input 
UOPs require only one entry. In this case, the counting logic 1617 will set the CO 
iiq)ut of flie latching device 1621 which will cause the selector 1637 to select tiie 
output from the subtract logic 163 1 . In this case, tiiere can be two possible outcomes 
depending on whether flie resource has at least one free entry to allocate. As 
described above, if the resource has at least one free entry then the output of tiie 
subtract logic 1631 is set to low, otherwise it is set to high. If the output of the 
subtract logic 1631 is set to low, there is at least one free entry in the resource and fliat 
is sufficimt because tiie input UOPs only require one entry in this example. If the 
output of the subtract logic 163 1 is set to high, the resource is fiiU and the required 
entry carmot be allocated in this example, which is a stall condition. As shown in 
Figure 16, flie ou^ut of the selector is inpntted into flie OR gate 1651. The output of 
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the selector 1637 represents a staU condition due to insufficient resource. 
Accordingly, if the output of flie selector 1637 is set to Mgh, the staU signal for thread 
0 will be activated. 

As described above, the third blodc contauis other stall conditions such as 
CRClear and CRNuke conditions. The signals representing CRClear conditions 1643, 
1645, and the CRNuke condition 1647 are also inputted into the OR gate 1651, In 
addition, the signals representing resource stall computations wifli respect to other 
resources are also inputted into the OR gate 1651* Accordingly, the stall signal for 
thread 0 will be activated if any one of the iiq)ut signals to the OR gate 1651 is set. 

Figure 17 is a block diagram of one embodiment of an apparatus for updatmg 
the value of the stall pomter for thread 0 (the blue thread) based upon various 
conditions discussed below, Ev^ytbhig discussed herem is equally ^licable to 
thread 1 (the red thread) stall pointer update fimctionaKty. In this embodiment, there, 
are three stall pointers that are maintained for each queue: one stall pomter for tiiread 
0 (the blue stall pointer), one stall pointer for thread 1 (flie red stall pointer), and one 
stall pointer for the working thread. In one embodunent, it is assumed in every clock 
that the machine would not stall and so the next stall pointer to be used for the stall 
conq)utation would be set as if the allocation were done successfully. If stall is in fact 
activated, the stall pomter will be restored back to its previous value to reflect tiiat no 
allocation is made in the last clock. 

As shown in Fipire 17, the new value of the stall pointer for thread 0 (also 
referred to as TO_STAIXJTR or BlueStallPtr) 1 791 can be set or updated to 
different values by the selector 1781 based upon the select signals 1777 and 1779. 
Basically, flie value of the BlueStallPtr is updated based on three different scenarios 
according to the values of the select signals 1777 and 1779, In the first scenario, the 
selector 1781 will select, as the BlueStallPtr 1791, the ou^ut of the selector 1767 if 
the second select signal 1779 is set. The select signal 1779 is set according to the 
Nuke Done signal 1701. In this case, when the Nuke Done signal 1701 is set, the 
BlueStallPtr 1791 is initialized to its appropriate starting value, which is zero in either 
ST mode or MT mode. As described above, thread 0 pointers and thread 1 pointers 
are initialized to point to then: corresponding portions of the queue based upon 
whether the current processing mode is ST or MT. For example, if the current 
processing mode is ST, flie end of queue witti respect to flu-ead 0 (flie blue thread) will 
be set to Q - 1 hy tiie selector 1733. If flie current processing mode is MT, flie end of 
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queue wifli respect to thread 0 (flie blue toead) will be set to Q/2 - 1 by the selector 
1733. 

to the second scenario, the selector 1781 will select the value of the previous 
stall pointer for thread 0 (PrevStallPtr) 1711 as the value for the BlueStallPtr 1791 if 
the first select signal 1777 is set and the second select signal 1779 is not set. 
Therefore, in this scenario, the BlueStallPtr 1791 is restored back to its previous value 
if the Nuke Done signal is not set and any of the following stall conditions is present: 
Clear Blue, WaitFlash Blue, Stall Bluel Stall Blue + 1, etc. All of these different stall 
conditions are input to ttie OR gate 1723 the output of which is used as the select 
signal 1777. As explained above, once it is determined that the stall signal is active, 
llie stall pointer needs to be restored back to its previous stall value to reflect fbe fact 
that no allocation was made in the last clock. 

In the third scenario, the selector 1781 will select the output of the selector 
1771 as the new value for the BlueStallPtr 1791 if both select signals 1777 and 1779 
are not set, i.e., if there is no nuke event and none of the stall conditions for thread 0 is 
present. In this case, if the blue thread is the current working tiiread, then the 
BlueStallPtr 1 791 will be incremented by a value corresponding to the number of 
required entries to be allocated, assuming that the queue has sufficient available 
entries to allocate the number of entries required by the input UOPs, Since the queue 
in this example is circular, the BlueStallPtr may be wrapped around if it advances past 
the end of its corresponding portion in the quexxe. Since the input UOPs may require 
from 0 to 3 entries in th^ queue for their execution, the StallPtr 1719 which represents 
the current value of the BlueStallPtr 1791 may be advanced by 0, 1, 2, or 3. To raable 
fest computation for the new value of the BlueStallPtr 1791, the four possible 
different values of the StallPtr 1719 are computed separately and compared against the 
appropriate value of the blue thread end of queue (EOQ) in parallel to determine 
whether wrapping around is needed. The selector 1737 will select either 0 or StallPtr 
+ 1 depending on whether StallPtr + 1 is greater than the EOQ. If the StallPtr + 1 is 
not greater than EOQ, then there is no wr^ around. If the StallPtr + 1 is greater than 
EOQ, then it is wrapped around to point to 0, the begnning of the queue. Similarly, 
the selector 1739 will select either 0, 1, or StallPtr + 2 depending on whether StallPtr 
+ 2 is greater than the EOQ. If StallPtr + 2 is not greater than the EOQ, then there is 
no wrap around. If StallPtr + 2 goes past the EOQ by 1, then it is wrapped around to 
point to 0, the begmning of the queue. If the StallPtr + 2 goes past the queue by 2, 
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thra it is wrapped around to point to 1. Likewise, the selector 1741 win select either 
0, 1, 2, or StallPtr + 3 depending on whcflierthe StalIPtr + 3 is greater than the EOQ. 
The outputs of the selectors 1737, 1739, and 1741 are then inputted to the selector 
177L The selector 1771 will select either the unchanged value of the StallPtr, the 
ou^ut of the selector 1737, the output of the selector 1739, or the output of the 
selector 1741, based vpon the select signal provided by the latching device 1749. 
Accordingly, if the input UOPs 1721 requke no entry in the queue, then the current 
value (unchanged) of the StallPtr 1719 will be selected as the new value for the 
BlueStallPtr 1791. If the input UOPs require 1 entry, then either StallPtr + 1 or its 
corresponding wr^ped around value will be selected as the new value for 
BlueStallPtr 1791 . If the mpnt UOPs require 2 entries, then either StallPtr + 2 or its 
corresponding wr^ped around value will be selected as the new value for the 
BlueStallPtr 1791. Finally, if the iapvA UOPs require 3 entries, then either StallPtr + 3 
or its corresponding wrapped around value will be selected as the new value for tiie 
BlueStallPtr 1791. 

Jn summary, the new value for the BlueStallPtr 1791 is updated as follows for 
the three different scenarios described above: 

Scenario 1 : When NUKE DONE signal is asserted indicating that the Nuke event 
is done. 

BhieStallPtr = 0 

Scenario 2: When NUKE DONE is not asserted .and at least one stall condition 
(e.g., stall due to insufficient resource, stall due to CRClear, etc.) is preset with 
respect to the blue thread. 

BhieStallPtr = PrevStallPtr 

Scenario 3: When NUEOB DONE is not asserted and there is no stall condition 
present with respect to the blue thread. 

If StallPtr R_CNT is not greater than the EOQ for the blue thread then 
BlueStallPtr = StallPtr + R^CNT 

else 

BlueStallPtr = Wr^Around(StallPtr + R^Q^T) 
where R_CNT is the number of entries required by the input UOPs. 
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Figure 18 shows a block diagram of one embodiment of an apparatus for 
updating (he allocation pointer used to allocate the required ratries for the working 
thread. In one embodiment, as described above, only one thread can allocate in any 
givOT clock cycle even though stall computations are performed for both threads in 
every clock cycle. Therefijre, the head pomters associated with a particular teead can 
be advanced or updated only in the clock cycle in which resource allocation is 
performed for that particular thread. The head pointers for a particular thread, for 
example thread 0, will not be advanced in the clock cycle in which either CRQear 
signal or any oUbsc stall condition is asserted with respect to that particular thread to 
reflect tiiat no allocation is made in that clock cycle. On CRNuke, after the state in 
the RAT 301 has been recovered for both threads and all marbles for both threads 
have been freed, the head pomters will be updated to point back to the appropriate 
locations in the queue based on the processing mode after CRNuke is done. If the 
new processing mode or new configuration is ST, then eiflier the head pointers for 
thread 0 or flie head pointers for thread 1 will be updated to point to the beginning of 
the queues, depending on whether thread 0 or thread 1 is the working thread in the ST 
mode. If the new processing mode is MT, then the head points for both .threads 0 
and 1 will be updated. In this case, the head pointers for thread OwiU be i^dated to 
point to the beguming of the queues wh^eas the head pointers for thread 1 will be 
updated to point to the middle of the queues. 

In one embodiment, there are three head pointers associated with eadi resource 
or queue: one head pointer for thread 0 (also referred to as TOJEffiAD_PTR or 
BlueHead), one head pointer for thread 1 (also referred to as Tl JIEAD J*TR or 
Redhead), and one head pointer to be used for the current workmg thread (referred to 
as HEAD^PTR or Head Ptr). hi one embodiment, a thread bit indicating tiie currrat 
working thread will be used in addition to the working head pointer to select the 
appropriate entries in the queue to be allocated for the input UOPs. 

Referring to Figure 18, the new value for the Head Ptr 1891 can be set or 
updated to different values based on different scenarios according to the current 
allocation thread ID 1801 and the currmt stall thread ID 1851. The current allocation 
thread ID 1 801 is used to indicate the particular thread being allocated in the current 
clock cycle, i.e.. the working thread. The cunent stall thread ID 1851 is used to 
indicate the particular thread that will be allocated in the next clock cycle since stall 
computation is con:q)uted ahead of the allocation. The allocation thread ID 1801 and 
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Ihe Stall thread ID 1851 are inputted to amutual exclusion logic 1871 which geaerates 
the appropriate select signals for the selector 1881 according to the values of the 
aUocation thread ID 1801 and the stall thread ID 185L If both the allocation thread 
. ID 1801 and the stall thread ID 1851 have the same values (e.g., both indicate blue or 
both indicate red), flien the selector 1881 will select the output of the selector 1855 as 
the new value for the Head Ptr 1891. If the allocation thread ID 1801 is red and the 
stall thread ID 185 1 is blue, then the value of the head pointer for the blue thread (Le., 
the BliieHead) 1 860 will be selected as the new Head Ptr 1 891 . If the allocation 
thread ID is blue and the stall thread ID is red, then the value of die head pointer for 
die red thread (i.e., the RedHead) 1865 will be selected as tiie new Head Ptr 1891' 
The selector 1855 selects, as its output, either the output of the selector 1 840, the 
output of the selector 1845, or the current value of die Head Ptr 1891, based on the 
select signals that are generated by the AND gate 1830 and the OR gate 1835. The 
two inputs to flie AND gate 1 830 are the two signals indicating that Nuke is done for 
the blue diread and Nuke is done for the red thread. Thus, the output of the AND gate 
is only set if both of these signals are asserted, Le., nuke is done for both threads. 
Th&e are four ii^uts to the OR gate 1 835. Accordingly, the output of the OR gate 
1835 is set if any of its four ixxpvXs is set. The four kputs to the OR gate 1835 
r^resent the different stall conditions for either the blue thread or the red thread 
which are selected accordingly by die selector 1825 based on the allocation thread ID 
1801. 

Referring again to the selector 1855, there are three different scenarios ttiat can 
occur based on die select signals fiom the AND gate 1 830 and the OR gate 1835. In 
the first scenario, if die output of the AND gate 1830 is set then die output of flie 
selector 1845 is selected as die ou^ut of die selector 1855. hi tiiis case, the new value 
for die Head Ptr 1891 wUl be initialized to 0 or Q/2 depending on the current 
processing mode as mdicated by ttie ST/MT signal 1822 and die allocation thread. In 
the second scenario, if the output of die AND gate 1830 is not set and the output of 
the OR gate 1835 is set, dien ttie value of die Head Ptr 1891 is not updated. This is 
die case where die allocating thread is being stalled and therefore no allocation is 
made. Accordmgly, the Head Ptr is not updated to reflect the feet diat no allocation is 
made in the current clock cycle due to a siaJl condition. 

In die diird scenario, if bodi tiie outputs of die AND gate 1 830 and die OR gate 
1835 are not set, then die output of die selector 1840 will be selected as die new value 
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for the Head Ptr 189L This is the case where there is no nuke event and ttiere is no 
stall condition present and therefore the cmrent Head Ptr 1 891 needs to be advanced 
by a corresponding value to reflect the number of entries allocated in this clock cycle. 
Since the queue in this example is circular, if flie value of the Head Ptr 1891 plus the 
count value 1808 is greater tiian the end of queue then it will be wrapped around. The 
selector 1840 will select as its output either the output of the adder 1826 or its 
corresponding wrap around value based on the result of the corn>arison generated by 
the compare logic 1827. The output of the selector 1840 will flien be selected as the 
new value for the Head Ptr 1891. 

The invention has been described in conjunction with the preferred 
embodiment. It is evident that numerous alternatives, modifications, variations and 
uses will.be predated by those skilled in the art m light of the foregoing description. 



36 



wo 01/48599 



PCTAJSOO/32241 



CLAIMS 

What is claimed is: 

1 . A mdhod of managing resources in a multithreaded processor, the method 
comprising; 

partitioning a i^ource into a number of portions based upon a number of 
threads being executed concimently; and 

performing resource allocation fer each thread in its respective portion of the 
resource, 

2. The method of claim 1 wherein partitiadng comprises: 

sizing the corresponding portion for each thread according to a partitioning 
sdieme; and 

marlnng the corresponding portion as being reserved for the respective thread. 

3. ThemethodofcIaim2whereia the size of each portion is determined based 
upon at least one fector selected fix)m the group consisting of a first factor indicatmg 
the number of threads being ^ecuted concunently, a second factor iodicating the 
capacity of the resource, and a third fector indicating a relative processing priority of 
each thread. 

4. The method ofclaim 2 wherein marking comprises: 

specifying the lower and upper boundaries of each portion correspondmg to its 
respective location widnn the resource. 

5. The method ofclaim 1 further comprising: 

initializing each portion of the resource in response to one or more signals 
indicating a mode transition. 

6. The method of claim 5 wherem the mode transition is invoked in response to 
an event or condition. 

7. The method ofclaim S wherein initializmg comprises: 
initialling a set of pomters corresponding to the respective portion. 
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8, The method of claim 7 wherein the set of pomters comprises a first pointer 
used to keep track of entries that have been allocated in the respective portion and a 
second pointer used to keq) trade of entries that h^ve been deaUocat^ 

respective portioa 

9, The method of claim 1 wherein performing resource allocation for each thread 
comprises: 

perfonning stall computation for each thread to determine whether the 
respective portion has suflBcient available entries to allocate a number of entries 
required for the execution of one or more mstmctions from the respective thread; and 

allocating the number of entries requned m the respective portion if the 
respective portion has suflBcient available entries. 

10. The method of claim 9 wherein performing stall computation for each thread is 
done in parallel with performing stall computation for another thread 

11. • The method ofclaim 9 wherein perfoiiningsM computation for each thre^ 
and perfonning stall computation for another thread are multiplexed 

i 12. The method ofclaim 9 wherein allocating the number ofentries required for 
each thread is done in parallel with allocating a number of entries required for another 
thread. 

13. The method ofclaim 9 wherein allocating the number of entries required for 
each thread and allocating a number of entries required for another thread are 
multiplexed 

14. The metfxod ofclaim 9 ^rfierein performing stall computation for each thread 
comprises: 

determining the number ofentries to be allocated for the one or more 
instructions from the respective thread; 

deteraiining a number ofentries available in the respective portion; and 
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comparing flie number of entries to be allocated with the nimber of entries 
available in the respective portion. 

15. The method ofclaim 14 further conq)rismg: 

activating one or more stall signals if the number of entries required exceeds 
flie numbCT of entries available in the respective portion, the one or more stall signals 
indicating that the one or more instructions from the respective thread cannot be 
executed due to insufBcient available resource in the respective portion. 

16. The metiiod of claim 14 wherein deterauning the number of entries to be 
allocated for the one or more instructions comprises: 

determining the ^e of the one or more instructions; and 
determinmg whethw the resource is needed to execute tiie one or more 
instructions based upon the type of the one or more instructions. 

17. The method of claun 16 wherein the number of entries to be allocated is 
greater than the number of entries needed to execute the one or more instructions. 

18. Themethod of claim 14 wherein determining flie number of entries available 
comprises: 

comparing the value of the first pointer with the value of the second pointer to 
determine the number of entries that are available for allocation. 

19. The metiiod of claim 18 fiirther comprising: 

wrapping the first pointer when it is advanced past tiie end of the respective 
portion. 

20. The method of claim 19 includmg: 

updating a wrap bit to indicate that the first pointer is wrapped around. 

21 . The method of claim 1 8 further comprising: 

wrapping the second pointer when it is advanced past the end of respective 
portion. 

. 22. The method of claim 21 including: 



39 



wo 01/48599 PCT/USOO/32241 

iqxiating a wrap bit to indicate that toe second pointer is wr^ped around. 

23. A method of managing a resource in a multifhreaded processor, the method 
comprising: 

detecting a signal indicating a processing mode; 

performing resource allocation according to a multitbread scheme if the 
processing mode is multithreading; and 

performing resource allocation according to a single thread scheme if the 
processing mode is single threading. 

24. An £q)paratus for managmg a resource in a multithreaded processor, flie 
^paratns comprising: 

partition logic to partition the resource into a number of portions 
conesponding to a nmnber of threads being executed concurrently; and 

resource control logic to perform resource allocation for each thread in its 
respective portion of the resource. 

2i5. An apparatus for controlling usage of a resource in a multithreaded processor, 

the apparatus comprising: 

detection logic to detect a signal indicating a processing mode; and 

a control circuit to perform resource allocation according to a single thread 

schOTie if the processing mode is single threading and to perform resource allocation 

according to a multitbread scheme if the processing mode is multithreading. 

26. A processor comprising: 

an instmction delivery engme to store and fetch instructions either Scorn one or 
more tlieads based upon a current processing mode; and 

an allocator to receive instructions from the instruction delivery engine and to 
perform allocation in a resource based vcpaa the current processing mode. 

27. An apparatus for managing a resource in a multithreaded processor, the 
apparatus comprising: 

means far assigning a portion of the resource to each of a plurality of threads 
being executed concurrrotiy m the multithreaded processor, and 
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means for perfonning resource aUocation for each respective thread in its 
respective portion of the resource. 

28. An ^aratus for controlling usage of a resource, the tq)paratus comprising: 
detection means for detecting a signal indicating a processing mode; and 
control means for perfonning resource allocation according to a single thread 
scheme if the processing mode is single threading and for performing resource allocation 
according to a multithread scheme if the processing mode is multithreadmg. 
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